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Abstract 

Rumor source identification in large social networks has re- 
ceived significant attention lately. Most recent works deal 
r-H with the scale of the problem by observing a subset of the 
nodes in the network, called sensors, to estimate the source. 
This paper addresses the problem of locating the source of 
Qna rumor in large social networks where some of these sen- 
^ sor nodes have failed. We estimate the missing information 
about the sensors using doubly non-negative (DN) matrix 
CN completion and compressed sensing techniques. This is then 
used to identify the actual source by using a maximum like- 
lihood estimator we developed earlier, on a large data set 
from Sina Weibo. Results indicate that the estimation tech- 
niques result in almost as good a performance of the ML 
I ^ . estimator as for the network for which complete informa¬ 
tion is available. To the best of our knowledge, this is the 
first research work on source identification with incomplete 
^ information in social networks. 

01 Introduction 
O 

Q\ On April 2013, hackers took control of the Twitter account 
@AP and sent a fake tweet about explosions in the White 
House. U.S. financial markets were spooked by this tweet; 
. . the index value of S&P 500 dropped 14 points, wiping out 
^ $136.5 billion in a matter of seconds before the financial 
^ markets recovered [^. At a time when cybersecurity has 
^ become a major national issue, the ease of rumor spread 
Cd through social networks has exacerbated concerns. More 
specifically, studies show that rumors spread much faster 
in social networks than other type of networks, even faster 
than networks with complete graph topology [^. Therefore, 
it is of great interest to pinpoint the source of the rumor in 
time by leveraging the social network topology and observ¬ 
ing the state of nodes. The practical applications include 
rapid damage control and understanding the role of net¬ 
work structure in rumor dissemination, thereby facilitating 
the design of sophisticated policies to prevent further viral 
spreading of misinformation through social networks in the 
future. 

The various approaches in locating the source of a rumor 
may be classified based on whether they rely on observing 
all the nodes in the social network [sjj^ or a fraction of 
nodes in the social network 10 11 . It is impractical to ob- 


amount of the computational complexity that is involved. 
One means to deal with the complexity issues is by select¬ 
ing a subset of nodes (also called sensors) [M 11 . In 10 


a maximum likelihood (ML) estimator was proposed using 
measurements by the sensors. It was shown that an av¬ 
erage source localization error of less than 4 hops can be 
achieved by observing 20% of the nodes in network. In 11 


we proposed a two-stage source localization algorithm that 
required 3% less sensor nodes to provide measurements on 
the time of arrival of information, and yet provided results 
with the same accuracy as previous studies. 

In most practical scenarios, it is not possible to observe 
the status of all nodes in a large-scale social network. There¬ 
fore, the source of rumor must be located based on the mea¬ 
surements collected by a subset of nodes (called sensors) in 
the social network. The sensors record the arrival times of 
the rumor to estimate the most likely source. However, in 
most practical scenarios, we may not have complete infor¬ 
mation on the time at which the sensors receive the rumor. 
This could happen because most social networks such as 
Twitter do not provide public access to their full stream 
of tweets and many Facebook users keep their activity and 
profiles, private. Overall, the rapid growth of the social 
networks themselves, and the increasing volume of their 
generated data, will likely augment the problem of missing 
data in the study of rumor diffusion. This paper presents 
a technique to locate the source of a rumor for large social 
networks where the information on the time at which the 
sensors receive the rumor is incomplete. 

Using incomplete information to estimate the source of 
rumors is achieved by recovering the missing information. 
Such data recovery was addressed in the context of com¬ 
puter networks 12 and sensor networks 13 . In the con¬ 


text of social networks, recovering the missing information 
is essentially a matrix completion problem. There are sev¬ 
eral approaches to matrix completion 
compressed sensing 


matrix completion 16118 


TilfTs 


14 T^. We deploy 


serve all the nodes in the social network due to the large 


and doubly non-negative (DN) 
to recover the missing informa¬ 
tion on the time epochs at which certain sensors receive the 
information. We also present a renewal theory-based argu¬ 
ment to improve the DN completion based estimation. We 
use the estimated values to identify the source of rumors 
using a maximum likelihood (ML) estimator we developed 
in . Results indicate that these estimation methods pro¬ 
vides us with almost as good a performance of source iden¬ 
tification as that when complete information is available. 

The rest of the paper is organized as follows. In Sec- 




















tion the rumor diffusion model and the source estima¬ 
tor are discussed. Section presents different approaches 
to recover missing values at the sensors using compressed 
sensing, DN matrix completion, and renewal theory-based 
model. Experimental results are provided in Section |^and 
conclusions in Section IH 

2 Source Identification 


the cross-correlation matrix of difference in arrival times 
between the and the sensors. 

In the second stage, the search space will be limited to 
the nodes inside the cluster that is associated with Let 

^.cluster ^ ^ycluster^ ^cluster^ ^cluster) 

nodes inside the most likely candidate cluster. k 2 sensors 
are employed at this stage to collect information about the 
rumor. The corresponding optimal ML estimator is given 
by 


The source identification mechanism we designed in 11 is 
as follow^ A social network can be modeled as a graph, 
G(y,E), where a vertex, v G V represents a user in the 
network and two vertices, u, v G V share an edge if the cor¬ 
responding users share a friendship or any similar relation. 
Whenever a user tweets or posts a message (or a rumor), 
the people following the user or the friends of the user may 
re-tweet or re-post the rumor. Let tmn be the delay between 
the epochs at which nodes, m and n get “infected’ Qby a 


{)(2) = 


Then tr, 




) 10 . The parameters. 


l^mn and depend on the path between the nodes, m 
and n. A source, 5*, starts a rumor and spreads it on a so¬ 
cial network. The information diffuses through the network 
and reaches nodes, v e V along the shortest path from s 
to V. The goal is to determine s* given the time epochs at 
which nodes, v G S C V (the set of sensor nodes) receive 
the information. 

The source localization algorithm consists of two stages. 
In the first stage, the cluster that most likely contains the 
source of the rumor is identified and then, in the second 
stage, we search within this cluster and identify the source 
of the rumor. A new graph where 

ygate gateway nodes (nodes connecting clusters using 

between-cluster ties), is incident on the vertices in 

ygate^ Let S = be a set of ki nodes, selected 

from V^ate, to observe the time arrival of the rumor. Since 
the time that the source starts to spread information, t*, is 
typically unknown, inter-arrival times, Ati = (t^-ht*) —(ti + 
t*) = — ti, can be used for estimation, where U is the time 

at which the rumor is received at the sensor in 
The inter-arrival time observation vector is then defined as 
Afst^sei _ [At 2 , Ats,..., At/c_i]^ Since, all the nodes 
are equally likely to be the source of a rumor, the maximum 
likelihood (ML) estimator is the optimal estimator for the 
source of the rumor, described as 




arg max 


^ygate (^27v) b det(A^)^^^ 




arg max -^ 

cluster det(A^) ' 




exp(--(At**"S"2 _ 

( 2 ) 

where is the observation vector in the second stage 

and is the estimated source of the rumor. Detailed 
information about the two-stage algorithm can be found 
in [11 . It is observed that the ML source estimator requires 


(1) 


full information about the vectors, and 

In most practical scenarios, we may not have the entire 
information on the time epochs at which different sensors 
receive the information. This could be because most social 
networks such as Twitter do not provide public access to 
their full stream of tweets and most Facebook users keep 
their activity and profiles private. In the following section, 
we present estimation mechanisms that enable source iden¬ 
tification when incomplete information about At®^^^®^ and 
Atstage2- ig available. 

3 Source Identification with Partial 
Information 

We present three different approaches to recover the miss¬ 
ing information, which, in turn, will be used in the analy¬ 
sis to identify the source, as detailed in Section First, 
we present a compressed sensing based approach (Section 
3.1) that is effective in recovering sporadically missing in¬ 
formation. Then we present an approach using doubly non¬ 
negative (DN) matrix completion to recover information 
missing in bursts, in Section Then, in Section we 
improve the DN completion mechanism by using a renewal 
theory-based analysis. 

3.1 Compressed Sensing 

Consider a observation vector At = [Ati, At 2 ,At^]^, 
where Ati corresponds to difference in arrival times between 
the sensor and the reference sensor. Let y G be the 
vector of available entries in At where L < K. Hence, 


where /J^y{r) is the mean value of difference in arrival times 
between the first and the (r + 1)^^ sensors and A^(a,6) is 

^The details can be found in [ll] but we present the key results 
here to enable easier reading of this paper for the reader. 

2By “infected”, we mean that a node not only receives a rumor but 
also re-posts or re-tweets because he/she believes the rumor. 

^(.)^ represents the transpose of a vector or a matrix. 


y = </>At (3) 

where 0 is an L x A measurement matrix. In order to re¬ 
cover the original observation vector At from y, we assume 
that there exists an invertible K x K sparsifying matrix 'ip 
such that 


At = 


(4) 









where x G is M-sparse with M < i.e., it has only M 

non-zero entries. Using Eqn.(|^ and Eqn. @ we can write 

y = 0At = = 6>x (5) 

that is, in general, an ill-posed and ill-conditioned with 
0 = of dimensions L x K. Infinitely many solutions are 
possible unless we impose some additional constraints on 
At. Since the rumor spread along the shortest paths in the 
social network, the observation vector shows some amount 
of correlation among its elements. The correlation struc¬ 
ture of the observation vector makes it possible to acquire 
sufficiently accurate representations of the observation vec¬ 
tor without collecting time arrivals from each sensor node. 
Therefore, the vector At can be approximated by a low- 
rank vector X. Therefore, the problem becomes 


which we use here to estimate the missing delays between 
the times at which different sensor nodes obtain informa¬ 


tion. According to the analysis in 12 


X = D^cd" D. 


(9) 


where =diag(ai, 0 ^ 2 , • * * : and 

=diag(;di,/32, • • • ,/d^). Based on the values 

of E{X) = E([Xij] 1 < i < m ^ , the optimal values of 
a = and /3 = that minimize the mean 

square error in estimation (i.e., the MMSE estimate [^), 


is given by the iterative set of equations 12 


^ E{X)f3 

II/3IP 


( 10 ) 



s.t. y = ^x 

where ||x||^^ is the ^i-norm of x. 

Ultimately the estimated time arrival vector test is 

where Xopt is the solution to the problem in Eqn. 


/3 = 


E(X^)a 


oc 


( 11 ) 


It was shown in 12 that the set of iterative equations in 


Eqns. (10) and (11) converge if the condition number of the 
CO- variance matrix of X is less than 2. This can be satisfied 
by adding sufficiently large values to the diagonal elements 
of D (i.e., have da as a very large value instead of 0). 


3.2 DN completion 


3.3 Renewal Theory-based Model 


There are certain scenarios, where in the information on 
the time epochs of information arrival at different users 
corresponding to sensor nodes, may be missing in bursts. 
This could happen because certain users that act as sen¬ 
sors, remain idle, temporarily. Let Xij denote the time 
delay between the epochs when information propagated by 
sensor node i and that when it reaches node j. The delay 
between different nodes can then be written as a matrix, 
D = [Xij].^ ^.^g, where S is the set of all sensor nodes. Note 
that in D, Xu = 0, V L 

When some nodes temporarily get de-activated or unsub¬ 
scribe as a sensor, then the matrix, D, has certain entries 
missing. If the set of missing entries occur in bursts, then 
techniques like matrix completion can be used to determine 
the missing entries in the matrix, D 16 . These include 
inverse Ad—matrix completion 
(DN) completion 


17 , doubly non-begative 


18 . In order to perform these matrix 


completions, it is essential that the matrix, D represented 
as a graplj^the graph forms a block clique 19 . Then D is 


symmetric and of the form 


A c X 
D = I e 

X^ d B 


( 8 ) 


where e is any constant, A is a known m x m sub-matrix, 
with entries aij = Xij, the time delay between pairs of the 
first m sensors, c is an m x 1 vector, is a 1 X n vector 
and B is another nx n sub-matrix. An optimal mechanism 
for DN completion of such a matrix was discussed in 12 


this graph, each row or column represents a vertex and two 
vertices are joined by an edge if the value of the corresponding entry 
is known. 


The rumor dissemination process is depicted in Eigure[^as 
a function of time. The user corresponding to the sensor 
first receives the information at time, Then 

the sensor is idle for a period of time, S 2 and then relays 
the information at a time X-}^ -h 82 - The information is 


received by the sensor at a time, = X-^^ +5'2+X^ 


dl). 


^( 2 ) 




In general, the sensor receives the information for the 
mth gg epoch z\^\ stays idle for a time interval 

of length, Sjn+i and transmits the information at an epoch 
+ Sm+h which is received by the sensor for the 

(m + 1)^^ time at an epoch, z\j^^ + Sm+i + ^fj^^^^This 
can be considered as a renewal proeess with vaeations 


21 


Remark 3.1 Note that the renewal proeess will not be eon- 
tiguous in time, as represented in Figure\^ This is heeause, 
between the epoeh when sensor i reeeives the information for 
the (m — 1)^^ time and the time, there is a time delay. 
However, that delay will not affeet the renewal theory-based 
analysis sinee we are interested in determining the average 
time delay between the time epoehs when the and 
sensors reeeive the information. In other words, the blank 
period between the sueeessive time instants the sensor 
reeeives the information does not affeet the renewal proeess. 

The time intervals, xlj\ ^ 
independent where X^j^ ^ Aij, i.e., 

and - Fi,, i.e., < a;} = Fi,{x), m > 2. 

^Note that depends only on i and not on j. But we explicitly 
write j to be consistent with m > 2. 



















Table 1: Details of dataset 



Figure 1: Timing diagram of the process according to which 
sensor i receives the rumor and disseminates it to sensor j. 
Sensor i is idle or inactive for times, S\^\ m>2 and in the 
rri^^ dissemination attempt, takes a time, to actually 

reach node j after the information is transmitted. 


Similarly, Sk^ k > 2 are independent and identically dis¬ 
tributed (iid) Sk ^ V{x), V k, i.e., Pr§/c < x} = V{x), V k. 
Let a(x) = f{x) = and v{x) = i.e., the 

probability density function (pdf) of rn > 2, 

and S'/C, k>2 are aij{x), respectively. 


At any time epoch, t, let Yij{t) be defined as the remain¬ 
ing time or residual transmission time of the information 
from sensor i to the sensor j. Let Y(t) = 

S is the set of sensors. Then in the analysis for the DN 
completion described in Section |3.2[ specifically, we use 
limt^oo ^[Y(t)], instead of E{X.) in Eqns. (10) and (11). 
The following theorems from renewal theory will be used to 
characterize E[Y{t)]. 


Theorem 3.2 Consider a renewal proeess where the 

life time of the r^enewal is Xi. Let Xi ^ G{x), with pdf 

g{x) = and Xa, X3, - • • df{x), with pdf, h(x} = 

"^.Ut 


1 A 
L 


= E{X) = [1 - H{x)]dx = / xh{x)dx. (12) 

Jo Jo 


Then the eumulative distribution funetion (CDF) ofY{t), 
R{x) and the pdfofY{t), r{x) = are given by 

R{x) = J J*^QPr{Y{u) < x}du 


r{x) = /i[l — H{x)]. 


(14) 


Moreover, 


POO 

E[{Y)] =ii j x^h{x)dx = ixE{Xl), k>2, (15) 

Jx=0 



Max 

Ave 

Min 

Number of nodes 

43,545 

41,978 

40,445 

Number of edges 

84,451 

82,790 

80,923 

Diameter 

13 

11 

9 

Average shortest path length 

8.31 

5.97 

4.71 

Number of clusters 

223 

148 

103 


whieh, in turn, ean be re-written as 

E{Xk) 


E[{Y)] = 


(l + Cx), 


(16) 


where 


2 VarjXk) 


k>2. 


(17) 


Theorem 3.3 


A 


'21^ Let Ky{t) = Pr{y(t) < y}. Then 


lim 


■t^oo 


Ky{t) 


\imt^oo[E{Y{t)] = (1 + C%), 

where C^ 'Is given by Eqn. (Tiy. 


= R{y). 


(18) 


From Theorems |3.2| and |3.3[ The average remaining time 
for information to reach from sensors to each other, E{Y), 
which we use in Eqns. (10) and 0 is 


E{yij) = 


E(Xij)^E{Si) 


Var (Xij)^Var{Si) 
[E (Xij) + E(Si)f 


(19) 


In Eqn. (19), E {Xij) + E (Si) is the value E{Xij) used in 


DN completion in Section [3^ without applying the renewal 
theory-based analysis discussed in this subsection. The ex¬ 
pression for E (Yij) from Eqn. ( [T^ is substituted in Eqns. 
(10) and ( [TT| to obtain the modified DN completion using 
the renewal argument. The At estimated in Sections |3.1I 
|3.3| are in used in the ML estimator described in Section 
to identify the source of the rumor. 

4 Results and Discussion 

We conduct our experiments on the Sina Weibo dataset 
in [ 22 ]. Sina Weibo is the most popular microblogging ser- 
This dataset includes a followership net- 


vice in China 23 


work with 58,655,849 nodes and 265,580,802 edges, and 
a total of 370 million tweets and retweets. The retweet¬ 
ing paths (with their time-stamps) are provided which is 
suitable in particular for studying real information dissem¬ 
ination networks. We selected 100 tweets from this dataset 
which constitute 100 different real diffusion networks. Ta¬ 
ble 1 summarizes the details of the dataset. We used the 
Louvain method 


24 


to identify the clusters, as the gateway 
nodes of these clusters are used to construct the gateway 
graph G^ate^ Since it is assumed that rumors spread along 
the shortest paths into the social network, we selected nodes 
with high betweenness centrality as sensors. 

Eigure shows the accuracy of recovering the missing 
entries in At (employing compressed sensing) vs the per¬ 
centage of nodes used as sensors. The accuracy is defined 







































as the mean square error (MSE) between the original and 
the estimated missing entries. We randomly remove 15% 
and 30% of the entries sporadically to simulate the missing 
measurements. As can be seen from this graph, the esti¬ 
mation error is smaller when the missing rate is smaller. 
It could be due to the fact that the number of remaining 
entries after 30% missing is less sufficient to precisely esti¬ 
mate the missing entries. However, the error gap between 
the 15% and the 30% missing rates is very small when the 
percentage of sensors is less than 0.3%. 
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Figure 3: Estimation error for the observation vector At 
when the missing rate is 0% (not missing measurements), 
15%, and 30% and deploying DN matrix completion (Sec¬ 
tion 3.2) and renewal based argument (Section [3.3[ ). The 
renewal theory based mechanism results in lower error be¬ 
cause it utilizes the first two moments of the missing mea¬ 
surements while the DN completion method uses only the 
first moment. 


Figure 2: Estimation error for the observation vector At 
when 15% and 30% of entries are missing and deploying 
compressed sensing (described in Section 3.1). 


To evaluate the DN completion approach, a sub-matrix 
of D is removed. The removed sub-matrix is chosen such 
that the graph representation of the partial matrix forms 
a block clique. Figure shows the estimation error of the 
DN completion. The renewal based argument provides less 
estimation error because it utilizes the first two moments 
of the dissemination intervals as opposed to the DN matrix 
completion method which uses only the first moment. It 
also shows that the accuracy improvement is larger when 
the missing rate is 15%. 

Next, we study the accuracy of the source estimation us¬ 
ing compressed sensing. The accuracy is measured in aver¬ 
age distance between the estimated and the actual sources. 
Figure]^ shows the source estimation error when the miss¬ 
ing rate is 0% (no missing measurements), 15%, and 30%. 
It shows that compressed sensing results in almost as good 
a performance of the ML estimator as for the network for 
which complete information is available. Figure shows the 
source estimation error when deploying DN matrix comple¬ 
tion and renewal-based argument. We again observe that 
the renewal based argument provides less estimation error 
in source localization. Results indicate that the estimation 
techniques result in almost as good a performance of the 
ML estimator as for the network for which complete infor¬ 
mation is available. 



Figure 4: Average distance between the estimated source 
and the actual source when 15% and 30% of entries in At 
are missing and deploying compressed sensing (described in 
Section |3.1 ). 














































































SIGMETRICS Perform. Eval. Rev.^ vol. 38, no. 1, pp. 
203-214, Jun. 2010. 



Figure 5: Average distance error between the estimated 
source and the actual source when 15% and 30% of entries in 
At are missing and deploying DN matrix completion (Sec¬ 
tion 3.2) and renewal-based argument (Section |3.3[ ). The 
renewal based argument provides less estimation error in 
source localization. 


5 Conclusions 

We addressed the problem of locating the source of a ru¬ 
mor in large-scale social networks with incomplete mea¬ 
surements. We presented the compressed sensing method 
to recover sporadically missing measurements and the dou¬ 
bly non-negative (DN) completion to recover measurements 
missing in bursts. Furthermore, we presented a renewal 
theory-based model to boost the performance of the DN 
matrix completion method. We then used the recovered 
measurements to estimate the source of the rumor. We 
observed that the compressed sensing and the DN matrix 
completion provide less estimation error when the percent¬ 
age of missing entries is less. It is also shown that the re¬ 
newal theory-based model increases the accuracy improve¬ 
ment of the DN matrix completion method. Mechanisms to 
jointly improve the ML estimator as well as the estimation 
of missing measurements, is under investigation. 
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